A Comparison of Document, Sentence, and Term Event Spaces
نویسنده
چکیده
The trend in information retrieval systems is from document to sub-document retrieval, such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend, systems continue to model language at a document level using the inverse document frequency (IDF). In this paper, we compare and contrast IDF with inverse sentence frequency (ISF) and inverse term frequency (ITF). A direct comparison reveals that all language models are highly correlated; however, the average ISF and ITF values are 5.5 and 10.4 higher than IDF. All language models appeared to follow a power law distribution with a slope coefficient of 1.6 for documents and 1.7 for sentences and terms. We conclude with an analysis of IDF stability with respect to random, journal, and section partitions of the 100,830 full-text scientific articles in our experimental corpus.
منابع مشابه
A Proficient Apprehension-Based Mining Replica
Most of the frequent techniques in text mining are based on the arithmetic scrutiny of a idiom, either word or slogan. Arithmetical scrutiny of a term incidence captures the consequence of the term within a manuscript only. However, two provisos can have the similar regularity in their documents, but one term contributes more to the connotation of its sentences than the further term. Thus, the ...
متن کاملStudying the status of science and technology education spaces in management of providing the space and educational equipment for schools in fundamental evolution document
The aim of designing the education spaces is to providing the required spatial relationships for education processes. It means that the spaces and equipment which are required in every field, how and by the which discipline thy have been arranged to offer the final aim of education in advance, is the achievements of providing the education spaces management. The purpose of study is to evaluate ...
متن کاملA Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملSelecting Labels for News Document Clusters
This work deals with determination of meaningful and terse cluster labels for News document clusters. We analyze a number of alternatives for selecting headlines and/or sentences of document in a document cluster (obtained as a result of an entity-event-duration query), and formalize an approach to extracting a short phrase from well-supported headlines/sentences of the cluster that can serve a...
متن کاملConcept-based Mining Model for Web Document Clustering
Most of the document clustering techniques are based on statistical analysis of a term, either a word or phrase.The statistical analysis of a term frequency captures the importance of the term within the document only. Thus, the underlying mining model should indicate terms that capture the semantics of the text. In this case, The mining model can capture terms that present the concepts of the ...
متن کامل